KhulnaSoft AI Compute Grid - Deploy#1
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 57 minutes and 46 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughA comprehensive new system for a production-ready hybrid GPU compute grid is introduced, including deployment automation (GitHub Actions), infrastructure provisioning (Terraform/Ansible/Kubernetes), multi-agent orchestration, distributed job scheduling, cost tracking, real-time monitoring APIs (FastAPI/WebSocket), governance/policy enforcement, and an interactive React dashboard. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟡 Minor comments (13)
agents/types/base_agent.py-12-12 (1)
12-12:⚠️ Potential issue | 🟡 MinorFix non-optional type hints for parameters defaulting to
None.Parameters at lines 12 and 19 have
Dict[str, Any]type hints but default toNone, which weakens the type contract. UseOptional[Dict[str, Any]](orDict[str, Any] | Nonein Python 3.10+) to properly indicate they acceptNone.Typing fix
-from typing import Dict, Any +from typing import Dict, Any, Optional class Agent(abc.ABC): @@ - def __init__(self, agent_id: str, role: str, config: Dict[str, Any] = None): + def __init__(self, agent_id: str, role: str, config: Optional[Dict[str, Any]] = None): @@ - def run(self, input_data: Any, context: Dict[str, Any] = None) -> Dict[str, Any]: + def run(self, input_data: Any, context: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents/types/base_agent.py` at line 12, The __init__ signature in BaseAgent uses non-optional Dict[str, Any] type hints for parameters that default to None; update those parameters in BaseAgent.__init__ (e.g., the config parameter and the other parameter on the same ctor) to Optional[Dict[str, Any]] or the Python 3.10+ union form Dict[str, Any] | None so the type hint matches the default value of None and static type checkers won't complain.platform/governance/quotas.py-34-38 (1)
34-38:⚠️ Potential issue | 🟡 MinorQuota usage accumulates indefinitely—no reset mechanism.
track_usageincrements a Redis key that never resets. Without a TTL or periodic reset (e.g., monthly), users will eventually exhaust their quota permanently.Would you like me to suggest a mechanism for periodic quota resets (e.g., using Redis key expiry or a scheduled reset job)?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/governance/quotas.py` around lines 34 - 38, track_usage currently calls r.incrbyfloat(f"{self.key_prefix}{user_id}:usage", actual_hours) with no expiry so quota accumulates forever; fix by ensuring a reset mechanism: either set a TTL on the usage key when it's first created (use r.expire or r.set(..., nx=True, ex=MONTH_SECONDS) after incr in track_usage) or implement a separate reset method (e.g., reset_monthly_quota(user_id) or reset_all_quotas()) and run it from a scheduled job/cron to clear or roll over keys; reference the track_usage method, self.key_prefix and the Redis client r when adding the expire or creating the scheduled reset.ui/src/components/GPUHeatmap.tsx-23-26 (1)
23-26:⚠️ Potential issue | 🟡 MinorPotential runtime error if
node.resourcesis undefined.Accessing
node.resources.vram_gbwill throw ifresourcesis missing or null. Also, the GPU is hardcoded to "NVIDIA A10G" regardless of actual hardware.🛡️ Defensive access with actual GPU data
<div className="mt-3 grid grid-cols-2 gap-2 text-[10px]"> - <div className="text-gray-500">VRAM: <span className="text-gray-300">{node.resources.vram_gb}GB</span></div> - <div className="text-gray-500">GPU: <span className="text-gray-300">NVIDIA A10G</span></div> + <div className="text-gray-500">VRAM: <span className="text-gray-300">{node.resources?.vram_gb ?? 'N/A'}GB</span></div> + <div className="text-gray-500">GPU: <span className="text-gray-300">{node.resources?.gpu ?? 'Unknown'}</span></div> </div>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/GPUHeatmap.tsx` around lines 23 - 26, In GPUHeatmap.tsx fix the unsafe access to node.resources.vram_gb and the hardcoded GPU string: change the JSX that references node.resources.vram_gb to defensively read the value (e.g. optional chaining or a fallback like node.resources?.vram_gb ?? '—') and replace the hardcoded "NVIDIA A10G" with the actual GPU model from the node (e.g. node.resources?.gpu_model ?? node.gpu?.model ?? 'Unknown GPU') so the component won't throw when resources is undefined and displays a sensible fallback.platform/scheduler/state.py-53-58 (1)
53-58:⚠️ Potential issue | 🟡 Minor
update_node_statussilently fails if node doesn't exist.If the node was concurrently deleted,
hgetreturnsNone, and this function does nothing. Callers (like the scheduler after dispatch) may not know the update failed.🛡️ Consider returning success status or logging
def update_node_status(node_id, status): node_data_str = r.hget("grid_nodes", node_id) if node_data_str: node_data = json.loads(node_data_str) node_data["status"] = status r.hset("grid_nodes", node_id, json.dumps(node_data)) + return True + else: + logger.warning(f"Cannot update status for missing node: {node_id}") + return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/scheduler/state.py` around lines 53 - 58, The update_node_status function currently does nothing when r.hget("grid_nodes", node_id) is None; change it to return a boolean success flag and surface a warning when the node is missing: after calling r.hget("grid_nodes", node_id) check for None and if missing call a logger (or r.logger) to warn that the node_id was not found and return False; otherwise proceed with json.loads, update node_data["status"], write back with r.hset and return True; keep function name update_node_status and the use of "grid_nodes", r.hget, r.hset, json.loads and json.dumps so callers can detect failure.platform/audit/observer.py-27-31 (1)
27-31:⚠️ Potential issue | 🟡 MinorIncorrect default for missing
created_atmakes runtime always 0.If
created_atis missing, the code defaults totime.time(), makingruntime = time.time() - time.time() ≈ 0. This silently masks the anomaly detection for jobs with missing timestamps.🔧 Suggested fix
# 1. Runtime Anomaly - start_time = job_data.get("created_at", time.time()) + start_time = job_data.get("created_at") + if start_time is None: + logger.warning(f"Job {job_data['id']} missing created_at timestamp") + continue runtime = time.time() - start_time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/audit/observer.py` around lines 27 - 31, The code uses time.time() as the default for job_data["created_at"], which makes runtime ≈ 0 when the timestamp is missing and hides anomalies; change the logic in the runtime check (variables: start_time, job_data, runtime, MAX_RUNTIME_SECONDS, logger.warning) to detect a missing or invalid created_at (e.g., if "created_at" not in job_data or falsy), log a distinct warning about the missing timestamp, and then set start_time to 0 (or another sentinel) so runtime = now - 0 yields a large value that will trigger the existing anomaly handling when appropriate.ui/src/index.css-1-3 (1)
1-3:⚠️ Potential issue | 🟡 MinorResolve Stylelint compatibility for Tailwind directives in this file.
Static analysis is currently flagging
@tailwindandtext-renderingvalue casing. Please align Stylelint config/rules (or local suppressions) so this stylesheet passes lint consistently.Also applies to: 15-15
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/index.css` around lines 1 - 3, Stylelint is tripping on Tailwind at-rules and the casing of the text-rendering value; update the Stylelint config so `@tailwind/`@apply/@variants/@responsive/@screen are allowed by adding them to the at-rule-no-unknown ignoreAtRules list (or install stylelint-config-tailwindcss), and relax the value-casing rule for text-rendering by either disabling value-keyword-case/declaration-property-value-case for that property or adding an exception (e.g., allow "optimizeLegibility") so index.css's `@tailwind` directives and text-rendering values pass linting.ui/src/components/UserQuotas.tsx-14-15 (1)
14-15:⚠️ Potential issue | 🟡 MinorClamp progress width to avoid overflow artifacts.
Line 14–15 can exceed 100% (or go negative), which can break bar rendering at Line 36 and Line 49.
Suggested fix
- const gpuProgress = (MOCK_QUOTAS.used / MOCK_QUOTAS.limit) * 100; - const costProgress = (MOCK_QUOTAS.cost_used / MOCK_QUOTAS.cost_limit) * 100; + const clamp = (v: number) => Math.max(0, Math.min(100, v)); + const gpuProgress = clamp((MOCK_QUOTAS.used / MOCK_QUOTAS.limit) * 100); + const costProgress = clamp((MOCK_QUOTAS.cost_used / MOCK_QUOTAS.cost_limit) * 100);Also applies to: 36-36, 49-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/UserQuotas.tsx` around lines 14 - 15, gpuProgress and costProgress can go below 0 or above 100 and break the progress bar rendering; clamp both computed values (gpuProgress and costProgress) to the 0–100 range before they are used (e.g., wrap the existing expressions with Math.min(Math.max(..., 0), 100)) and ensure any places that compute or pass progress (the calculations using MOCK_QUOTAS.used/limit and MOCK_QUOTAS.cost_used/cost_limit and where they are rendered) use the clamped values to prevent overflow/negative widths.ui/src/components/JobGantt.tsx-42-44 (1)
42-44:⚠️ Potential issue | 🟡 MinorTooltip status text is incorrect for active jobs.
Line 43 always shows
Result: Success, including whenjob.statusisrunning. This is misleading state reporting in the dashboard.Suggested fix
- ID: {job.id} | Result: Success + ID: {job.id} | Status: {job.status}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/JobGantt.tsx` around lines 42 - 44, Tooltip currently hardcodes "Result: Success" in JobGantt (the div that shows ID: {job.id}), causing incorrect display for running jobs; change the tooltip to derive its text from job.status (and job.result if available): when job.status === 'running' show "Status: Running", when job.status === 'completed' show "Result: Success" (or use job.result if provided), and when job.status === 'failed' show "Result: Failed"; update the JSX in JobGantt to conditionally render this text using job.status/job.result instead of the fixed "Result: Success".platform/scheduler/cost_engine.py-17-19 (1)
17-19:⚠️ Potential issue | 🟡 MinorClamp negative durations before billing.
Line 17–19 can compute negative cost if
start_timeis in the future (clock skew/bad input), which could reduce total spend incorrectly.🛡️ Small guard
- duration_seconds = time.time() - start_time + duration_seconds = max(0.0, time.time() - start_time)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/scheduler/cost_engine.py` around lines 17 - 19, The billing code in cost_engine.py computes duration_seconds, duration_hours and cost from start_time and can yield negative values if start_time is in the future; clamp the computed duration_seconds to zero before converting to hours and multiplying by PRICE_PER_HOUR (e.g., set duration_seconds = max(0, time.time() - start_time)) so that duration_hours and cost are never negative while preserving the existing variable names duration_seconds, duration_hours, cost and constant PRICE_PER_HOUR.ui/src/components/GridDashboard.tsx-53-57 (1)
53-57:⚠️ Potential issue | 🟡 Minor
selectedNodemay become stale when nodes are replaced via WebSocket.
selectedNodeis initialized withnodes[0]from the hardcoded initial state. WhensetNodes(data.nodes)replaces the array with server data,selectedNodestill references the old object. Consider updatingselectedNodewhen nodes change to keep it in sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/GridDashboard.tsx` around lines 53 - 57, selectedNode is initialized from the initial nodes array and can become stale when setNodes replaces the nodes array (e.g., from WebSocket updates); update selectedNode whenever nodes change by adding effect logic: watch nodes (the state from useState) and call setSelectedNode to a matching node from the new array (by id) or fallback to nodes[0] or undefined as appropriate. Locate the useState declarations for nodes, setNodes, selectedNode, setSelectedNode in GridDashboard.tsx and implement a useEffect that syncs selectedNode with the updated nodes array to avoid referencing the old object.platform/governance/policy_engine.py-43-47 (1)
43-47:⚠️ Potential issue | 🟡 MinorString containment check for restricted params is easily bypassed.
Simple substring matching (line 46) can be bypassed with whitespace variations (
rm -rf /), URL encoding, or command splitting. Consider using a more robust pattern-matching approach or command parsing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/governance/policy_engine.py` around lines 43 - 47, The current substring check using params_str = str(step.get("params", "")) against RESTRICTED_PARAMS can be bypassed by whitespace, URL-encoding, or tokenization tricks; update the check in the policy_engine to normalize and parse params before matching: decode URL-encoding, collapse repeated whitespace, then tokenize using a shell-aware parser (e.g., shlex) or apply robust regex patterns with word boundaries/anchors to match whole commands/tokens rather than substrings, and replace the simple containment test that appends to issues (issues.append(...)) with this normalized/tokenized matching against RESTRICTED_PARAMS.platform/api/ws.py-24-25 (1)
24-25:⚠️ Potential issue | 🟡 Minor
disconnect()may raiseValueErrorif websocket not in list.If
disconnect()is called twice for the same websocket (e.g., from exception handler afterWebSocketDisconnect),list.remove()will raiseValueError. Use a safe removal pattern.Proposed fix
def disconnect(self, websocket: WebSocket): - self.active_connections.remove(websocket) + try: + self.active_connections.remove(websocket) + except ValueError: + pass # Already removed🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/api/ws.py` around lines 24 - 25, The disconnect method currently calls self.active_connections.remove(websocket) which will raise ValueError if the websocket isn't present; update the WebSocketSessionManager.disconnect (or disconnect) implementation to remove the socket safely—either check "if websocket in self.active_connections: self.active_connections.remove(websocket)" or wrap remove in try/except ValueError and ignore missing entries—so repeated calls (e.g., after WebSocketDisconnect) won't raise.ui/src/components/GridDashboard.tsx-61-89 (1)
61-89:⚠️ Potential issue | 🟡 MinorWebSocket lacks error handling and reconnection logic.
The WebSocket connection has no
onerrorhandler and no reconnection strategy. If the connection fails or drops, the dashboard will silently stop receiving updates without user feedback.Proposed fix with error handling and reconnection
useEffect(() => { + let reconnectTimeout: NodeJS.Timeout | null = null; + + const connect = () => { const wsUrl = `ws://${window.location.hostname}:8000/ws`; ws.current = new WebSocket(wsUrl); ws.current.onmessage = (event) => { // ... existing handler }; + + ws.current.onerror = (error) => { + console.error('WebSocket error:', error); + }; + + ws.current.onclose = () => { + console.log('WebSocket closed, reconnecting in 3s...'); + reconnectTimeout = setTimeout(connect, 3000); + }; + }; + + connect(); return () => { + if (reconnectTimeout) clearTimeout(reconnectTimeout); ws.current?.close(); }; }, []);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/GridDashboard.tsx` around lines 61 - 89, The WebSocket in the useEffect currently only sets onmessage and closes on unmount, so add robust error handling and reconnection: attach ws.current.onerror to log/process errors and set a connection state (e.g., connectionStatus via setConnectionStatus), implement ws.current.onclose to trigger a reconnect loop using exponential backoff with a max retry cap, recreate ws.current and rebind onmessage/onerror/onclose when reconnecting, ensure setMetrics, setNodes, and setHistoricalData continue to be used after reconnect, and make sure to cancel pending reconnect timers and close the socket in the cleanup to avoid leaks.
🧹 Nitpick comments (18)
agents/types/specialized.py (1)
8-8: Add Optional typing to parameters with None defaults.Parameters at lines 8, 11, 25, 28, 37, and 40 have
Nonedefaults but non-optional type annotations (Dict[str, Any]). ImportOptionalfromtypingand update these annotations toOptional[Dict[str, Any]]for consistency with Python typing conventions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents/types/specialized.py` at line 8, Update type annotations for parameters that have None defaults to use Optional[Dict[str, Any]]: import Optional from typing and change the annotations for the affected parameters (e.g., the __init__ method signature shown and the other functions/methods referenced at lines 11, 25, 28, 37, and 40) from Dict[str, Any] to Optional[Dict[str, Any]] so the type matches the default None; ensure you adjust any corresponding type hints in those function/method definitions (keep parameter names and defaults unchanged).platform/audit/observer.py (2)
22-22: Style: Multiple statements on one line.Per static analysis (E701), refactor to separate lines for readability.
♻️ Suggested fix
raw_data = redis_conn.get(key) - if not raw_data: continue + if not raw_data: + continue job_data = json.loads(raw_data) - if job_data.get("status") != "running": continue + if job_data.get("status") != "running": + continueAlso applies to: 25-25
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/audit/observer.py` at line 22, The single-line statement "if not raw_data: continue" (and the similar instance at the other occurrence) violates E701; split it into two lines by putting the conditional and the continue on separate lines so that the check using raw_data is on one line and the continue is on the next, preserving the surrounding loop context (where the raw_data check occurs) and keeping the same logic in observer.py.
4-4: Unused import:redis.The module imports
redisbut usesredis_connfromplatform.scheduler.stateinstead.♻️ Remove unused import
import time import json import logging -import redis from platform.scheduler.state import r as redis_conn🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/audit/observer.py` at line 4, The file imports the unused symbol redis at the top of observer.py while the module uses the existing redis_conn from platform.scheduler.state; remove the unused import statement "import redis" to clean up the module and avoid unused-import linter errors, leaving any references to redis_conn intact.platform/ai/critic.py (1)
5-7: Avoidlogging.basicConfigat module level.Calling
logging.basicConfig(level=logging.INFO)at import time can override the application's logging configuration if this module is imported before the main app configures logging. This is a common gotcha in library/module code.♻️ Suggested fix
-# Setup logging -logging.basicConfig(level=logging.INFO) logger = logging.getLogger("AICritic")Let the application's entry point configure
basicConfiginstead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/ai/critic.py` around lines 5 - 7, Remove the top-level call to logging.basicConfig(level=logging.INFO) so this module does not configure global logging on import; keep the logger = logging.getLogger("AICritic") line and optionally add a short comment near logging.basicConfig or above logger noting that logging should be configured by the application entry point (do not call basicConfig inside platform/ai/critic.py).ui/src/components/GPUHeatmap.tsx (1)
3-3: Add TypeScript types for props to improve type safety.Using
anyfor props sacrifices TypeScript's compile-time checks. Consider defining proper interfaces:♻️ Suggested type definitions
interface NodeResources { vram_gb: number; gpu?: string; } interface Node { id: string; type: 'cloud' | 'local'; status: 'busy' | 'ready' | 'idle'; resources: NodeResources; } const NodeChip = ({ node }: { node: Node }) => { // ... }; export default function GPUHeatmap({ nodes }: { nodes: Node[] }) { // ... }Also applies to: 31-31
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/GPUHeatmap.tsx` at line 3, Replace the use of `any` for component props by declaring proper TypeScript interfaces and using them in the component signatures: define interfaces like `NodeResources` and `Node` (with fields vram_gb:number, optional gpu:string, id:string, type:'cloud'|'local', status:'busy'|'ready'|'idle', resources:NodeResources) and change `const NodeChip = ({ node }: any)` to `const NodeChip = ({ node }: { node: Node })`, and likewise update `GPUHeatmap` to accept `nodes: Node[]`; export or colocate the interfaces so both `NodeChip` and `GPUHeatmap` can reuse them.platform/memory/vector_store.py (1)
24-24: Use collision-safe IDs for findings.The current ID can collide when the same agent writes multiple findings within the same second.
♻️ Proposed fix
import logging import json import time +import uuid @@ - "id": f"{agent_id}_{int(time.time())}", + "id": f"{agent_id}_{uuid.uuid4()}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/memory/vector_store.py` at line 24, Replace the current collision-prone ID generation ("id": f"{agent_id}_{int(time.time())}") with a collision-safe scheme: generate a UUID (e.g., uuid.uuid4()) and combine it with agent_id (e.g., f"{agent_id}_{uuid4.hex}") or use time.time_ns() if you prefer monotonic nanosecond timestamps; update the ID assignment where the dict is built and add the necessary import for uuid (or use time.time_ns()) so multiple findings from the same agent in the same second cannot collide.platform/scheduler/heartbeat_monitor.py (1)
24-33: Isolate per-job decode/read failures instead of aborting the whole pass.A single bad/missing record currently skips recovery for all other jobs in that cycle.
🛡️ Proposed fix
- for key in all_job_keys: - job_data = json.loads(redis_conn.get(key)) + for key in redis_conn.scan_iter(match="job:*", count=500): + try: + raw = redis_conn.get(key) + if not raw: + continue + job_data = json.loads(raw) + except (TypeError, json.JSONDecodeError): + logger.warning(f"Skipping unreadable job record: {key}") + continue if job_data.get("status") == "running": @@ - except Exception as e: - logger.error(f"Error in recovery loop: {str(e)}") + except Exception: + logger.exception("Error in recovery loop")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/scheduler/heartbeat_monitor.py` around lines 24 - 33, The recovery loop should not abort on a single bad job record; wrap the per-key read/decode and subsequent checks in a per-job try/except so one broken entry doesn't stop the whole pass: when iterating keys, guard the redis_conn.get(key) + json.loads(...) + the status/node_id checks and fail_job call in a try block, log any exception (including the key and error) via logger.error, skip that entry and continue to the next key; also defensively handle redis_conn.get(key) returning None before calling json.loads to avoid a TypeError. Use the existing names (redis_conn.get, json.loads, job_data, active_node_ids, fail_job, logger) to locate and modify the code.ui/src/components/CostTimeline.tsx (1)
14-14: Replaceany[]with a concrete point type.Line 14 loses static guarantees for chart input shape; a typed
CostPoint[]avoids silent runtime mismatches.Suggested fix
- export default function CostTimeline({ data = MOCK_COST_DATA }: { data?: any[] }) { + type CostPoint = { time: string; spend: number }; + export default function CostTimeline({ data = MOCK_COST_DATA }: { data?: CostPoint[] }) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/CostTimeline.tsx` at line 14, The prop type for CostTimeline currently uses any[] which loses type safety; change the component signature to use a concrete CostPoint[] (e.g., CostTimeline({ data = MOCK_COST_DATA }: { data?: CostPoint[] })) and add or import a CostPoint definition matching the chart shape (fields like timestamp/date, cost, category, etc.) — update MOCK_COST_DATA to be typed as CostPoint[] and ensure any functions inside CostTimeline (parsers, mappers) use CostPoint to preserve static guarantees.ui/src/components/UserQuotas.tsx (1)
58-59: Use quota source-of-truth for RPM text.Line 58 hardcodes
60 requests/mininstead of readingMOCK_QUOTAS.rpm_limit, which risks UI/data drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/UserQuotas.tsx` around lines 58 - 59, The UI hardcodes "60 requests/min" in the UserQuotas component instead of using the canonical quota value; update UserQuotas.tsx to read the RPM display from the source-of-truth (MOCK_QUOTAS.rpm_limit) and render that value (e.g., `${MOCK_QUOTAS.rpm_limit} requests/min`) so the text stays in sync with the data model; reference the UserQuotas component and the MOCK_QUOTAS.rpm_limit symbol and ensure any necessary import or prop access is used to obtain the value before rendering.ui/src/components/MemoryGraph.tsx (1)
13-17: ReplaceCOLORS: anywith a typed map to restore compile-time safety.
anyon line 13 disables type checking forCOLORS[entry.type]on line 49. While the current data inMOCK_MEMORYonly uses valid type values ('exploit', 'recon', 'intel'), using a typed map prevents invalid keys from being accessed without detection.♻️ Suggested typing change
-const COLORS: any = { +type MemoryType = 'exploit' | 'recon' | 'intel'; +const COLORS: Record<MemoryType, string> = { exploit: '#f43f5e', // rose-500 recon: '#3b82f6', // blue-500 intel: '#8b5cf6', // violet-500 };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ui/src/components/MemoryGraph.tsx` around lines 13 - 17, Replace the untyped COLORS constant with a properly typed map so accesses like COLORS[entry.type] are checked at compile time; update COLORS to use a Record or mapped type keyed by the allowed entry types (e.g., 'exploit'|'recon'|'intel') or reference the existing union/enum for memory entry types, keeping the same color string values so code using COLORS (e.g., COLORS[entry.type] in MemoryGraph) benefits from static type safety.platform/api/monitor.py (2)
12-12: Move import to top of file.PEP 8 recommends placing all imports at the top of the module. This import is placed after module-level variable assignments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/api/monitor.py` at line 12, The import of get_active_nodes from platform.scheduler.state is placed after module-level variable assignments; move this import statement up into the main import block at the top of the module so all imports (including "from platform.scheduler.state import get_active_nodes") appear before any module-level variables or code execution to comply with PEP8.
31-32: Exception handling swallows HTTP status context.Returning a dict with
status: "error"on any exception results in a 200 OK HTTP response, making it harder for clients to detect failures. Consider raising anHTTPExceptionto return appropriate status codes.Proposed fix
+from fastapi import APIRouter, Depends, HTTPException ... except Exception as e: - return {"status": "error", "message": str(e)} + raise HTTPException(status_code=500, detail=str(e))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/api/monitor.py` around lines 31 - 32, The except block that currently does "except Exception as e: return {'status': 'error', 'message': str(e)}" swallows HTTP status context and returns 200; change it to raise a FastAPI HTTPException instead (import HTTPException and appropriate status from fastapi/status) inside the same except Exception as e handler so the endpoint returns a non-200 status (e.g., status.HTTP_500_INTERNAL_SERVER_ERROR) and include str(e) as the detail; update the handler surrounding the function (the endpoint function in platform/api/monitor.py) to raise HTTPException rather than returning the dict.runtime/dsl/engine.py (3)
3-3: Unused import:json.The
jsonmodule is imported but never used in this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runtime/dsl/engine.py` at line 3, The import statement "import json" in runtime/dsl/engine.py is unused and should be removed; delete the unused import line (the "import json" symbol) to clean up the module and avoid linter warnings.
70-77:run_stepis mocked and doesn't integrate with Ray adapter.The
run_stepmethod simulates execution withtime.sleep(1)instead of callingrun_workflow_step_on_rayfromruntime/dsl/ray_adapter.py. The docstring acknowledges this is a mock, but consider adding a TODO or feature flag to enable Ray integration.Proposed integration with Ray adapter
+from runtime.dsl.ray_adapter import run_workflow_step_on_ray + +USE_RAY = False # Feature flag for Ray execution + def run_step(self, name, agent, params): """ - Mock execution of a step. - In production, this calls Ray tasks. + Executes a workflow step, optionally on Ray. """ - # Simulate work - time.sleep(1) - return {"last_action": f"finished {name}", "data": f"processed_{name}"} + if USE_RAY: + return run_workflow_step_on_ray(agent, name, params) + else: + # Mock execution for testing + time.sleep(1) + return {"last_action": f"finished {name}", "data": f"processed_{name}"}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runtime/dsl/engine.py` around lines 70 - 77, The run_step method currently mocks execution with time.sleep(1) and returns fake results; replace or conditionally delegate to the Ray integration by calling the run_workflow_step_on_ray adapter function from runtime/dsl/ray_adapter.py (or add a feature-flag/ENV check) so real Ray tasks are executed when enabled: update run_step to import and call run_workflow_step_on_ray(agent, name, params) when the Ray flag is set (fall back to the existing mock for local/testing), and add a clear TODO comment and unit test path for both code paths; reference the run_step method and run_workflow_step_on_ray symbol to locate where to change behavior.
13-17:StepStatusenum is defined but never used.The
StepStatusenum is declared but the workflow engine uses string literals for status tracking instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runtime/dsl/engine.py` around lines 13 - 17, The StepStatus enum is declared but unused; update the workflow engine to use StepStatus members instead of raw string literals for status tracking (e.g., replace occurrences of "pending"/"running"/"completed"/"failed" with StepStatus.PENDING/StepStatus.RUNNING/StepStatus.COMPLETED/StepStatus.FAILED or their .value where a string is required). Search for status reads/writes in functions like any step execution, status update, and serialization logic and switch them to use the StepStatus enum (or call .value when persisting/serializing) so the enum is actually used consistently across the engine.platform/governance/policy_engine.py (1)
11-11:MAX_JOBS_PER_SESSIONis defined but never used.This constant is declared but not enforced anywhere in the policy engine.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/governance/policy_engine.py` at line 11, MAX_JOBS_PER_SESSION is defined but never enforced; update the policy engine logic that creates or enqueues jobs (e.g., functions/methods like add_job, schedule_job, enqueue_job, or PolicyEngine.process_jobs) to check the current session job count against MAX_JOBS_PER_SESSION and reject or defer new jobs when the limit is reached. Implement the check to either raise a clear exception or return an error status/log (use existing process_logger or error handling conventions), and add unit tests to cover hitting the limit to ensure the constant is actually applied.platform/api/ws.py (1)
53-53: Useasyncio.get_running_loop()instead of deprecated pattern.
asyncio.get_event_loop().time()is deprecated in Python 3.10+. Useasyncio.get_running_loop().time()in async contexts.Proposed fix
- "timestamp": asyncio.get_event_loop().time() + "timestamp": asyncio.get_running_loop().time()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@platform/api/ws.py` at line 53, Replace the deprecated asyncio.get_event_loop().time() call used to populate the "timestamp" value with asyncio.get_running_loop().time() in the async WebSocket message construction (the dict key "timestamp" in platform.api.ws); update the expression where "timestamp" is set so it calls get_running_loop().time() to avoid the deprecated pattern in async contexts.runtime/dsl/ray_adapter.py (1)
25-36: Transient actor per step is inefficient; actors are never cleaned up.Creating a new
AgentExecutoractor for each workflow step (line 31) incurs significant overhead in Ray. Actors are designed to be reused across multiple calls. Additionally, the actor is never explicitly terminated, which may cause resource leaks in long-running workflows.Consider using an actor pool or reusing actors across steps with the same
agent_type.Proposed fix with actor cleanup
def run_workflow_step_on_ray(agent_type, step_name, params): """ Helper to dispatch a DSL step to a Ray Actor. """ + executor = None try: # Create a transient actor for the step executor = AgentExecutor.remote(agent_type, params) future = executor.execute_task.remote(step_name, params) return ray.get(future) except Exception as e: - logger.error(f"Ray execution error: {str(e)}") + logger.error("Ray execution error: %s", e) raise + finally: + if executor is not None: + ray.kill(executor)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runtime/dsl/ray_adapter.py` around lines 25 - 36, The function run_workflow_step_on_ray creates a new AgentExecutor actor for every call and never cleans it up; instead implement a simple actor cache/registry keyed by agent_type so AgentExecutor actors are created once and reused (lookup-or-create an actor for agent_type, call executor.execute_task.remote(step_name, params), and ray.get the future), and add a module-level shutdown/teardown function that iterates the cached actors and calls ray.kill (or the appropriate Ray termination API) to explicitly terminate them; reference AgentExecutor, run_workflow_step_on_ray, execute_task and the actor cache/registry when making the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 997bba79-d76f-4657-9204-36fb09e66232
📒 Files selected for processing (56)
.github/workflows/deploy.yml.gitignoreREADME.mdagents/example_agent.pyagents/types/base_agent.pyagents/types/specialized.pyinfra/ansible/gpu-node.ymlinfra/k8s/crds/aijob.yamlinfra/k8s/gpu/device-plugin.yamlinfra/k8s/ray/head.yamlinfra/k8s/ray/worker.yamlinfra/k8s/scaling/keda.yamlinfra/terraform/cloud/main.tfplatform/ai/critic.pyplatform/ai/workflow_generator.pyplatform/api/auth.pyplatform/api/db.pyplatform/api/main.pyplatform/api/monitor.pyplatform/api/requirements.txtplatform/api/ws.pyplatform/audit/observer.pyplatform/governance/policy_engine.pyplatform/governance/quotas.pyplatform/intelligence/adaptive_weights.pyplatform/memory/vector_store.pyplatform/operator/controller.pyplatform/scheduler/autoscaler.pyplatform/scheduler/cost_engine.pyplatform/scheduler/heartbeat_monitor.pyplatform/scheduler/lifecycle.pyplatform/scheduler/locks.pyplatform/scheduler/resource_manager.pyplatform/scheduler/scheduler.pyplatform/scheduler/state.pyruntime/dsl/dag_manager.pyruntime/dsl/engine.pyruntime/dsl/ray_adapter.pyruntime/ray/worker_task.pyui/index.htmlui/package.jsonui/postcss.config.jsui/src/App.tsxui/src/components/AdaptivePanel.tsxui/src/components/CostTimeline.tsxui/src/components/DecisionTimeline.tsxui/src/components/GPUHeatmap.tsxui/src/components/GridDashboard.tsxui/src/components/JobGantt.tsxui/src/components/MemoryGraph.tsxui/src/components/NodeDrillDown.tsxui/src/components/PolicyGuardStatus.tsxui/src/components/UserQuotas.tsxui/src/index.cssui/src/main.tsxui/tailwind.config.js
💤 Files with no reviewable changes (1)
- .gitignore
| build-and-push: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - uses: actions/checkout@v3 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat .github/workflows/deploy.ymlRepository: CtxOS/ctxpod
Length of output: 1372
🌐 Web query:
actions/checkout v4 latest version 2024 2025
💡 Result:
The latest version of actions/checkout v4 is v4.3.1, released on 2025-11-17. This ports cleanup changes from v6 to the v4 branch. Prior versions include v4.3.0 (2025-08-11, documentation and dependency updates), v4.2.2 (2024-10), v4.2.0 (2024-09-25), and v4.0.0 (2023-10-17, updated to node20 runtime). Use uses: actions/checkout@v4 to pin to the v4 branch, which always points to the latest v4 release. Note that newer major versions like v6.0.2 (2026-01-09) exist with Node.js 24 support.
Citations:
- 1: https://github.com/actions/checkout/blob/v4/CHANGELOG.md
- 2: https://github.com/actions/checkout/commits/v4/
- 3: https://github.com/actions/checkout/releases
- 4: https://github.com/actions/checkout/blob/main/CHANGELOG.md
- 5: https://github.com/actions/checkout/releases/tag/v4.3.1
- 6: https://github.com/actions/checkout/releases/tag/v4.3.0
- 7: https://redirect.github.com/actions/checkout/releases/tag/v4.3.1
🌐 Web query:
docker/setup-buildx-action latest version 2024 2025
💡 Result:
The latest version of the docker/setup-buildx-action GitHub Action is v4.0.0, released on 2026-03-05. This is a major release featuring Node 24 as the default runtime (requires Actions Runner v2.327.1+), removal of deprecated inputs/outputs, switch to ESM, and various dependency updates including @actions/core to 3.0.0 and @docker/actions-toolkit to 0.79.0. Previous recent versions: - v3.12.0 (2025-12-19): Deprecated install input, dependency bumps. - v3.11.1 and earlier in 2025. Usage example from official docs: - name: Set up Docker Buildx uses: docker/setup-buildx-action@v4 Repository last pushed on 2026-04-09 (as of search data), confirming v4.0.0 remains latest. Official Docker docs and GitHub Marketplace also reference v4.
Citations:
- 1: https://github.com/docker/setup-buildx-action
- 2: https://github.com/docker/setup-buildx-action/blob/master/README.md
- 3: https://github.com/docker/setup-buildx-action?search=1
- 4: https://docs.docker.com/build/ci/github-actions/configure-builder/
- 5: https://github.com/marketplace/actions/docker-setup-buildx
- 6: https://github.com/docker/setup-buildx-action/releases
- 7: https://github.com/docker/setup-buildx-action/
- 8: https://github.com/docker/setup-buildx-action/releases/latest
- 9: https://github.com/docker/setup-buildx-action/releases/tag/v4.0.0
🌐 Web query:
docker/login-action latest version 2024 2025
💡 Result:
The latest version of the docker/login-action GitHub Action is v4.1.0, released on April 2, 2026. This is the current latest as of 2026-04-16. Recent releases in 2024-2026: - v4.1.0 (2026-04-02): Fix scoped Docker Hub cleanup path, dependency bumps including @aws-sdk/client-ecr to 3.1020.0 and @docker/actions-toolkit to 0.86.0. - v4.0.0 (2026-03-04): Node 24 as default runtime, switch to ESM, major dependency updates like @actions/core to 3.0.0. Usage example from official docs: - name: Login to Docker Hub uses: docker/login-action@v4.1.0 with: username: ${{ vars.DOCKERHUB_USERNAME }} password: ${{ secrets.DOCKERHUB_TOKEN }} Official repository: https://github.com/docker/login-action
Citations:
- 1: https://github.com/docker/login-action/releases
- 2: https://github.com/docker/login-action
- 3: https://github.com/docker/docker-login-action/
- 4: https://github.com/marketplace/actions/docker-login
- 5: https://docs.docker.com/guides/gha/
- 6: https://github.com/docker/login-action/releases/tag/v4.0.0
- 7: https://redirect.github.com/docker/login-action/releases
🌐 Web query:
docker/build-push-action latest version 2024 2025
💡 Result:
The latest version of the docker/build-push-action GitHub Action is v7.1.0, released on 2026-04-10. This is confirmed by the GitHub releases page listing it as the top release dated 10 Apr (which aligns with 2026-04-10 given today's date of 2026-04-16), and the tags page showing v7.1.0 as the first tag. Previous versions include v7.0.0 (2026-03-05) and v6.19.x in early 2026. In 2024, versions were in the v6.x series (e.g., v6.4.0 in July 2024), and in 2025 leading up to v6.19.0 in February 2026. Official examples in the README and Marketplace use v7.
Citations:
- 1: https://github.com/docker/build-push-action/releases
- 2: https://github.com/docker/build-push-action
- 3: https://github.com/marketplace/actions/build-and-push-docker-images
- 4: https://github.com/docker/build-push-action/releases/tag/v7.0.0
- 5: https://github.com/docker/build-push-action/tags
- 6: https://github.com/docker/build-push-action/releases/tag/v7.1.0
Update GitHub Actions to current major versions.
These pinned action versions are significantly outdated. Update to the latest available versions:
actions/checkout@v3→ usev4or latestv6docker/setup-buildx-action@v2→ usev3or latestv4docker/login-action@v2→ use latestv3orv4docker/build-push-action@v4→ use latestv6orv7
The major versions in the workflow file are no longer current and may face compatibility issues with modern GitHub runners.
🧰 Tools
🪛 actionlint (1.7.12)
[error] 17-17: the runner of "actions/checkout@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue
(action)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/deploy.yml at line 17, Update the pinned GitHub Action
versions to current majors: replace uses: actions/checkout@v3 with a current
major (e.g., actions/checkout@v4 or `@v6`), update docker/setup-buildx-action@v2
to v3 or v4, docker/login-action@v2 to v3 or v4, and docker/build-push-action@v4
to v6 or v7; make these replacements where those action usages
(actions/checkout, docker/setup-buildx-action, docker/login-action,
docker/build-push-action) appear in the workflow so the workflow uses
maintained, compatible major versions.
| ami = "ami-gpu-ubuntu-22" # placeholder for actual GPU AMI | ||
| instance_type = "g5.xlarge" # NVIDIA A10G GPU instance |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify AMI literal shape in Terraform files (read-only).
rg -nP '^\s*ami\s*=\s*"([^"]+)"' infra/terraform/cloud/main.tf
python - <<'PY'
import re, pathlib
p = pathlib.Path("infra/terraform/cloud/main.tf")
text = p.read_text()
m = re.search(r'ami\s*=\s*"([^"]+)"', text)
if not m:
print("No AMI assignment found")
else:
ami = m.group(1)
print("Found:", ami)
print("Valid EC2 AMI pattern:", bool(re.fullmatch(r"ami-[0-9a-fA-F]{8,17}", ami)))
PYRepository: CtxOS/ctxpod
Length of output: 182
Use a real AMI ID instead of the placeholder.
The current value ami-gpu-ubuntu-22 does not match the valid EC2 AMI ID format (pattern: ami-[0-9a-fA-F]{8,17}) and will cause Terraform provisioning to fail.
Suggested fix
resource "aws_instance" "gpu_node" {
- ami = "ami-gpu-ubuntu-22" # placeholder for actual GPU AMI
+ ami = var.gpu_ami_id
instance_type = "g5.xlarge" # NVIDIA A10G GPU instance🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@infra/terraform/cloud/main.tf` around lines 6 - 7, The AMI value is a
placeholder and not a valid EC2 AMI ID, so replace the hardcoded ami =
"ami-gpu-ubuntu-22" with a valid AMI lookup or valid AMI ID; specifically either
set ami to a real AMI string matching ami-[0-9a-fA-F]{8,17} or use a data source
(e.g., data "aws_ami") to look up the latest Ubuntu GPU AMI and reference that
result where the current ami attribute is used alongside instance_type =
"g5.xlarge".
| # In production, these would be in a database or vault | ||
| VALID_API_KEYS = { | ||
| os.getenv("GRID_ADMIN_KEY", "khulnasoft-admin-123"): "admin", | ||
| os.getenv("GRID_AGENT_KEY", "khulnasoft-agent-456"): "agent" | ||
| } |
There was a problem hiding this comment.
Critical: Hardcoded fallback API keys create security vulnerability.
If GRID_ADMIN_KEY or GRID_AGENT_KEY environment variables are not set, the system falls back to predictable hardcoded credentials ("khulnasoft-admin-123", "khulnasoft-agent-456"). An attacker could authenticate with these well-known values.
🔒 Proposed fix to require explicit configuration
-# In production, these would be in a database or vault
-VALID_API_KEYS = {
- os.getenv("GRID_ADMIN_KEY", "khulnasoft-admin-123"): "admin",
- os.getenv("GRID_AGENT_KEY", "khulnasoft-agent-456"): "agent"
-}
+# API keys must be explicitly configured via environment variables
+_admin_key = os.getenv("GRID_ADMIN_KEY")
+_agent_key = os.getenv("GRID_AGENT_KEY")
+
+if not _admin_key or not _agent_key:
+ raise RuntimeError("GRID_ADMIN_KEY and GRID_AGENT_KEY environment variables must be set")
+
+VALID_API_KEYS = {
+ _admin_key: "admin",
+ _agent_key: "agent"
+}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@platform/api/auth.py` around lines 8 - 12, VALID_API_KEYS currently uses
hardcoded fallback values via os.getenv(..., "khulnasoft-...") which creates a
predictable backdoor; update the initialization so os.getenv is called without
default fallback for GRID_ADMIN_KEY and GRID_AGENT_KEY, validate that both are
present, and if either is missing raise a clear runtime/configuration error (or
abort startup) rather than populating VALID_API_KEYS with defaults; reference
the VALID_API_KEYS variable and the environment names GRID_ADMIN_KEY and
GRID_AGENT_KEY and ensure future secrets come from secure config/vault instead
of hardcoded strings.
| from sqlalchemy import create_all, Column, String, Integer, Float, DateTime, JSON, ForeignKey | ||
| from sqlalchemy.ext.declarative import declarative_base | ||
| from sqlalchemy.orm import sessionmaker, relationship | ||
| from sqlalchemy import create_engine |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does SQLAlchemy expose create_allfromsqlalchemytop-level, or should table creation be done viaBase.metadata.create_all()?
💡 Result:
SQLAlchemy does not expose a top-level create_all function that can be imported directly from the sqlalchemy package (e.g., from sqlalchemy import create_all). Instead, table creation is performed via Base.metadata.create_all for ORM declarative models or metadata_obj.create_all for Core MetaData objects.
Citations:
- 1: http://docs.sqlalchemy.org/en/latest/core/metadata.html
- 2: https://www.sqlalchemy.org/docs/21/core/metadata.html
- 3: https://docs.sqlalchemy.org/en/stable/tutorial/metadata.html
- 4: https://docs.sqlalchemy.org/en/14/tutorial/metadata.html
- 5: https://docs.sqlalchemy.org/en/latest/core/metadata.html
🏁 Script executed:
# Check if there's a requirements file or setup file with SQLAlchemy version
find . -type f \( -name "requirements*.txt" -o -name "setup.py" -o -name "setup.cfg" -o -name "pyproject.toml" \) | head -20Repository: CtxOS/ctxpod
Length of output: 87
🏁 Script executed:
# Read the db.py file to understand the context
cat -n platform/api/db.py | head -50Repository: CtxOS/ctxpod
Length of output: 2161
🏁 Script executed:
# Search for usage of ForeignKey and relationship in the codebase
rg "ForeignKey|relationship" platform/api/ --max-count 20Repository: CtxOS/ctxpod
Length of output: 241
🏁 Script executed:
# Check SQLAlchemy version
cat platform/api/requirements.txtRepository: CtxOS/ctxpod
Length of output: 86
🏁 Script executed:
# Get full db.py to see if ForeignKey and relationship are actually used
wc -l platform/api/db.pyRepository: CtxOS/ctxpod
Length of output: 77
🏁 Script executed:
# Read the full db.py file
cat -n platform/api/db.pyRepository: CtxOS/ctxpod
Length of output: 2873
🏁 Script executed:
# Check if there's a pyproject.toml or other config that might specify SQLAlchemy version
cat pyproject.toml 2>/dev/null || cat setup.py 2>/dev/null || echo "No standard config found"Repository: CtxOS/ctxpod
Length of output: 80
🌐 Web query:
SQLAlchemy declarative_base import location: sqlalchemy.ext.declarative vs sqlalchemy.orm in SQLAlchemy 1.4 and 2.0
💡 Result:
In SQLAlchemy 1.4, declarative_base is imported from sqlalchemy.orm: from sqlalchemy.orm import declarative_base. The legacy import from sqlalchemy.ext.declarative is deprecated since 1.4, as declarative functionality was integrated into sqlalchemy.orm, emitting deprecation warnings for ext.declarative usage. In SQLAlchemy 2.0, declarative_base is superseded by subclassing DeclarativeBase from sqlalchemy.orm: from sqlalchemy.orm import DeclarativeBase; class Base(DeclarativeBase): pass. The function-based declarative_base is legacy/deprecated, with DeclarativeBase providing better typing support without plugins. Example for 1.4: from sqlalchemy.orm import declarative_base Base = declarative_base Example for 2.0: from sqlalchemy.orm import DeclarativeBase class Base(DeclarativeBase): pass
Citations:
- 1: https://docs.sqlalchemy.org/14/orm/declarative_tables.html
- 2: https://docs.sqlalchemy.org/en/stable/changelog/migration_14.html
- 3: https://www.sqlalchemy.org/docs/14/orm/declarative_styles.html
- 4: https://docs.sqlalchemy.org/14/orm/tutorial.html
- 5: https://docs.sqlalchemy.org/20/orm/declarative_styles.html
- 6: https://docs.sqlalchemy.org/en/stable/changelog/whatsnew_20.html
- 7: https://docs.sqlalchemy.org/20/orm/declarative_tables.html
- 8: https://docs.sqlalchemy.org/en/14/orm/declarative_styles.html
- 9: https://docs.sqlalchemy.org/14/orm/declarative_styles.html
- 10: https://docs.sqlalchemy.org/en/latest/changelog/whatsnew_20.html
- 11: set declarative_base() / registry.declarative_base() as legacy, introduce new base class approach sqlalchemy/sqlalchemy#7562
Fix SQLAlchemy imports; create_all cannot be imported from top-level SQLAlchemy.
The import from sqlalchemy import create_all is invalid and will fail at import time. Additionally, ForeignKey and relationship are imported but never used. The import of declarative_base should also be updated to use the current recommended location in sqlalchemy.orm (the import from sqlalchemy.ext.declarative is deprecated as of SQLAlchemy 1.4).
Proposed fix
-from sqlalchemy import create_all, Column, String, Integer, Float, DateTime, JSON, ForeignKey
-from sqlalchemy.ext.declarative import declarative_base
-from sqlalchemy.orm import sessionmaker, relationship
-from sqlalchemy import create_engine
+from sqlalchemy import Column, String, Integer, Float, DateTime, JSON, create_engine
+from sqlalchemy.orm import declarative_base, sessionmaker🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@platform/api/db.py` around lines 1 - 4, Remove the invalid top-level import
of create_all and the unused ForeignKey and relationship imports, and update the
deprecated declarative_base import to the current location; specifically, in
platform/api/db.py stop importing create_all from sqlalchemy (use
Base.metadata.create_all(engine) instead), replace "from
sqlalchemy.ext.declarative import declarative_base" with "from sqlalchemy.orm
import declarative_base", and remove ForeignKey and relationship from the import
list so you only import Column, String, Integer, Float, DateTime, JSON (and keep
sessionmaker and create_engine) as needed.
| @app.get("/job/{job_id}") | ||
| def get_job_status(job_id: str): | ||
| job_data = r.get(f"job:{job_id}") | ||
| if not job_data: | ||
| raise HTTPException(status_code=404, detail="Job not found") | ||
| return json.loads(job_data) |
There was a problem hiding this comment.
Protect /job/{job_id} with auth and ownership checks.
This endpoint returns the full job payload, including params and user_scope, without authentication. Any caller who knows a job ID can read another tenant's job metadata.
🔐 Minimal authorization guard
`@app.get`("/job/{job_id}")
-def get_job_status(job_id: str):
+def get_job_status(job_id: str, user_scope: str = Depends(get_api_key)):
job_data = r.get(f"job:{job_id}")
if not job_data:
raise HTTPException(status_code=404, detail="Job not found")
- return json.loads(job_data)
+ job = json.loads(job_data)
+ if job.get("user_scope") != user_scope:
+ raise HTTPException(status_code=403, detail="Forbidden")
+ return job🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@platform/api/main.py` around lines 108 - 113, The get_job_status endpoint
currently returns full job payloads without authentication; require
authentication and enforce ownership/tenant checks before returning job_data:
add the existing auth dependency (e.g., get_current_user or your auth
middleware) to the get_job_status route, parse job_data (from r.get) into an
object, compare the job's owner/tenant field (e.g., job_obj["user_id"] or
job_obj["user_scope"]) against the authenticated user's id/scope, and if they
don't match raise HTTPException(401/403). Also consider removing or redacting
sensitive fields like "params" and "user_scope" from the returned payload unless
the authenticated user is authorized to see them. Ensure these checks live in
get_job_status so unauthorized callers cannot fetch another tenant's job.
| time.sleep(2) # Small gap between lock attempts | ||
| time.sleep(5) |
There was a problem hiding this comment.
Fix the stray indentation at Line 93.
The second time.sleep(5) is indented under no enclosing block, so Python raises IndentationError and the scheduler module cannot be imported.
🧹 Minimal fix
- time.sleep(2) # Small gap between lock attempts
- time.sleep(5)
+ time.sleep(2) # Small gap between lock attempts🧰 Tools
🪛 Ruff (0.15.10)
[warning] 93-93: Unexpected indentation
(invalid-syntax)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@platform/scheduler/scheduler.py` around lines 92 - 93, There is a stray
indentation on the second time.sleep(5) call which causes an IndentationError;
locate the two calls to time.sleep (the time.sleep(2) and the indented
time.sleep(5)) and unindent time.sleep(5) so it lines up with time.sleep(2) (or
remove the duplicate sleep if unintended), ensuring both calls are at the same
block level in the scheduler lock/retry logic.
| def get_active_nodes(): | ||
| """ | ||
| Returns a list of all nodes that haven't timed out. | ||
| """ | ||
| all_nodes = r.hgetall("grid_nodes") | ||
| active_nodes = [] | ||
|
|
||
| for node_id, data_str in all_nodes.items(): | ||
| # Check if heartbeat key still exists | ||
| if r.exists(f"node_heartbeat:{node_id}"): | ||
| active_nodes.append(json.loads(data_str)) | ||
| else: | ||
| # Cleanup stale node | ||
| logger.warning(f"Node {node_id} timed out. Removing from registry.") | ||
| r.hdel("grid_nodes", node_id) | ||
|
|
||
| return active_nodes |
There was a problem hiding this comment.
Race condition: concurrent get_active_nodes() calls can corrupt state.
This function is called from multiple concurrent processes (scheduler, heartbeat_monitor, autoscaler per context snippets). The pattern of iterating hgetall, checking exists, and deleting via hdel is not atomic—creating race conditions:
- TOCTOU: Between
r.exists()andr.hdel(), another process may register or use the node - Scheduler dispatch race: The scheduler may select a node that another concurrent call is about to delete
- Lost updates:
autoscaler.pyalso directly callshdelon the same keys
🔒 Consider using a Lua script for atomic cleanup or distributed locking
+CLEANUP_SCRIPT = """
+local all_nodes = redis.call('HGETALL', KEYS[1])
+local active = {}
+for i = 1, `#all_nodes`, 2 do
+ local node_id = all_nodes[i]
+ local data = all_nodes[i+1]
+ if redis.call('EXISTS', 'node_heartbeat:' .. node_id) == 1 then
+ table.insert(active, data)
+ else
+ redis.call('HDEL', KEYS[1], node_id)
+ end
+end
+return active
+"""
+
def get_active_nodes():
- all_nodes = r.hgetall("grid_nodes")
- active_nodes = []
-
- for node_id, data_str in all_nodes.items():
- if r.exists(f"node_heartbeat:{node_id}"):
- active_nodes.append(json.loads(data_str))
- else:
- logger.warning(f"Node {node_id} timed out. Removing from registry.")
- r.hdel("grid_nodes", node_id)
-
- return active_nodes
+ result = r.eval(CLEANUP_SCRIPT, 1, "grid_nodes")
+ return [json.loads(data) for data in result]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@platform/scheduler/state.py` around lines 35 - 51, get_active_nodes currently
does r.hgetall + per-entry r.exists + r.hdel which is a TOCTOU race; replace
that non-atomic pattern by moving the check-and-remove into a Redis Lua EVAL
call so the whole iteration, existence check of node_heartbeat:{node_id},
deletion from hash ("grid_nodes") and collection of active entries run
atomically; update the get_active_nodes function to call that Lua script
(instead of using r.hgetall / r.exists / r.hdel), returning the active node JSON
strings, and keep the same returned structure so callers of get_active_nodes
don't change.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
New Features
Infrastructure
Documentation